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Abstract Efficient big data clustering is a requirement for 
massive data generating in this digitalized connected 
world. The traditional clustering algorithms do not scale 
over massively sized and highly unstructured big data. 
Thus, to obtain efficiency in clustering big data new 
architecture and programming paradigm is required. In this 
work, a novel MapReduce-based Fuzzy C-Medoids clus- 
tering algorithm is designed and experimented with to 
cluster big data repository of documents datasets. The 
performance of the proposed algorithm is experimentally 
evaluated for different-sized Hadoop cluster sizes and 
different-sized document datasets. The algorithm is found 
to be scalable and efficient in performing clustering jobs. 
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Introduction 


Big data is a term used to specify and separate a dataset 
from other datasets based on features like massiveness in 
size, unstructured-ness or semi-unstructured-ness in data, 
noise in data, and high speed in generating the data [1]. 
Document datasets are an example of big data due to their 
complete unstructured-ness and high dimensionality. 
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Clustering is a well-known and widely used data mining 
task which groups data objects of a dataset into multiple 
groups based on the closeness in attributes of the data 
objects [2]. The big data clustering for document datasets is 
mainly used for the automotive assemblage of documents 
based on the appearance of similar unique words in the 
documents [3]. Documents clustering is motivated and 
applied for various reasons such as depicting the docu- 
ments in the hierarchy, searching for indexed documents, 
information filtering, arranging web documents in search 
engines, etc., to name a few [4]. 

A popular clustering algorithm is K-Means [5]. It is a 
partitioning algorithm that runs for i iterations, and for each 
iteration it inputs n objects and k has randomly chosen 
objects from n as initial centroids. 

The K-Means algorithm works as below: 


1. Select the resulting cluster number (K) and obtain the 
same K initial data objects as initial centroids 

2. Repeat steps 3 and 4 (for some predefined iterations or 
met convergence criteria) 

3. for each data object: 


find the most adjacent centroid 
allocate the data object to that cluster 


4. for each cluster: new centroid = mean of data objects 
allocated to that cluster 
5. End 


Fuzzy C-Means is an extension of K-Means with fuzzy 
logic [6]. Fuzzy C-Means is a soft clustering technique 
where a membership between 0 and | is calculated for 
determining the proximity of an object to each centroid. 
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The cluster recalculation is performed using objects and 
their corresponding membership values for a centroid using 
the fuzzy formula [7]. The time complexity of fuzzy 
C-Means is higher than K-Means, but it provides better 
quality clusters [8]. A modification of fuzzy C-Means is 
fuzzy C-Medoids where the nearest object of a newly 
calculated fuzzy C-Means centroid is selected as a new 
centroid and replaced with the older centroid. Fuzzy 
C-Medoids produces quality clustering results and less 
sensitive to outliers than other partitioning algorithms 
[9, 10]. 

The traditional clustering algorithms are becoming 
inefficient for clustering big data due to massiveness in size 
and multiple dimensions [5]. Recent works show the needs 
and benefits of modifying traditional clustering algorithms 
using recent programming paradigms and computing 
architectures. A few recent research shows that MapRe- 
duce-based algorithms and their variants are more efficient 
for clustering while executing over Hadoop distributed 
computing framework [11-13]. 

This paper presents the design of a novel MapReduce- 
based fuzzy C-Medoids algorithm and its analysis on 
execution over the Hadoop cluster. The main contribution 
of this work lies in the experimentation of the proposed 
algorithm on big data of document type for clustering. The 
dataset consists of 10,000 dimensions which are success- 
fully clustered using fuzzy logic-based MapReduce 
enhancement and executed over Hadoop distributed 
architecture. The size of datasets used for experimentations 
ranges from 100 MB to 1 GB, including a standard dataset 
concerning variable Hadoop cluster size of up to 10 node 
clusters. The novelty of the work is that there is no such 
work reported in the literature with such detailed design 
and details of MapReduce-based fuzzy C-Medoids algo- 
rithm and such large-scale experimentations for many 
datasets, each with 10,000 dimensions and different 
Hadoop clusters. The proposed algorithm is found efficient 
and effective while clustering large datasets. 


Literature Review 


K-Means algorithm is being used for document clustering 
for a long [14-16]. The K-Means are reported to be inef- 
ficient for clustering massive document datasets [17, 18]. 
To improve the quality of clustering fuzzy logic-based 
K-Means algorithm is developed [19]. The K-Medoids 
algorithm is an extension of the K-Means algorithm where 
the K-Means generated centroids are replaced with the 
nearest objects from the dataset. Thus, making the 
K-Means generated centroids are values in the space, but 
the K-Medoids generated centroids are real data objects. 
The terms C-Means and K-Means and, similarly, 
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C-Medoids and K-Medoids are used interchangeably in the 
literature, and this will reflect in our literature survey also. 

The fuzzy C-Medoids is a modification of fuzzy 
C-Means in terms of centroid recalculation. Both of these 
two algorithms calculate the membership matrix the same 
way for each centroid and object pairs. Then the centroid is 
recalculated using two stages: 


1. The fuzzy C-Medoids uses the fuzzy C-Means formula 
for recalculation of centroids at the first hand. 

2. Step (1) produces new centroids (as per fuzzy C-Means 
algorithm) which are then replaced with the actual 
objects from the dataset (known as medoids). These 
medoids are obtained by finding the nearest objects 
from the dataset for each newly calculated centroid. 


It is a worry for this algorithm’s capability of coping 
with big datasets such as documents. To obtain needed 
efficiency, these algorithms are being adapted for recent 
architectures [20]. There are a few works reported in recent 
literature that presented MapReduce-based K-Means 
(21, 22]. Literature provides a few modifications of fuzzy 
C-means for clustering document big data. Many of the 
modifications are merely optimizations of traditional fuzzy 
C-means. There are only a few MapReduce-based fuzzy 
C-means reported in the literature. The number of works 
reported in the literature on MapReduce-based K-Means is 
quite high as compared to MapReduce-based fuzzy 
C-means. This literature survey mainly focuses on most 
noted works of traditional along with MapReduce-based 
fuzzy C-Medoids modification for document datasets. 

In the literature [23], the algorithm adds k value under 
constraint conditions and only needs one iteration to get 
clustering results for ontology data. It solves the random- 
ness of clustering center selection and improves its effi- 
ciency in achieving global optimization. The result 
demonstrates that the precision ratio and efficiency of the 
improved algorithm rise considerably. In total, 500 docu- 
ments of RDF data are used for experimentations. 

In the literature [24], web documents and snippets are 
clustered with fuzzy C-Medoids and provide good cluster 
results. The author claims that the complexity compares 
very favorably with other fuzzy algorithms for relational 
clustering. The algorithms converge very quickly in 5 or 6 
iterations. The document is made from 1042 abstracts from 
the Cambridge Scientific Website. 

In the literature [25], K-Medoids was applied to classify 
semantics of English documents using the Cloudera dis- 
tributed environment of Hadoop MapReduce. Mappers 
perform the assignment of data vectors to centroids, and 
reducers recalculate the centroids and then find the nearest 
medoids. The authors claim this is the first work on 
MapReduce-based K-Medoids. In total, 2,000,000 English 
documents were used for experimentations. 
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In this work [26], parallel K-Medoids is implemented 
through two phases: parallel seeding and parallel refine- 
ment. The first phase performs a global search over sample 
data, and the second phase performs a local search over the 
entire data. Experiments are conducted on Spark as well as 
Hadoop using various real-world data sets on 12 Microsoft 
Azure machines (48 cores). The results show it signifi- 
cantly outperforms most of the recent parallel algorithms 
while produces a quality clustering. The dataset used in the 
experiments is small-sized standard datasets. 

Based on the methodologies and clustering result 
obtained from these above-mentioned works, we can obtain 
the following points: 


e The document should be preprocessed properly before 
an input for clustering. It is reported that tools like 
Mahout sometimes provide preprocessed document 
datasets that are not effective in clustering [27]. 

e Fuzzy C-Medoids, for document clustering, is designed 
and experimented with within few cases only. 

e Fuzzy logic-based clustering usually takes a large time 
in providing clustering results on traditional architec- 
tures even with small- and moderate-sized document 
datasets [28-30]. 

e There is only a few reported pieces of literature on 
document clustering using MapReduce-based fuzzy 
C-Medoids. In these works fewer details of experimen- 
tations, simpler algorithm design specification, only 
pseudo-distributed mode Hadoop installation for exper- 
imentations and smaller document datasets for perfor- 
mance analysis are used [31]. 

e The MapReduce-based design of traditional algorithms 
is accomplished in each design such that the mappers 
compute the parallel parts of the algorithm, while the 
reducer computes the serial parts of the algorithm. 


The above observations derived from the literature 
survey drive us to the following open areas in clustering 
document big data using MapReduce-based fuzzy 
C-Medoids: 


e Although there are a very small number of works 
reported in the literature on MapReduce-based fuzzy 
C-Medoids, there is no work that provided the algo- 
rithm design with figure-based representation and 
detailed explanation using mapper and reducer parts. 
In this paper, we have provided an in-depth design of 
the MapReduce-based fuzzy C-Medoids in the method- 
ology section. The reason for choosing a part as the 
mapper and another part as a reducer is also justified. 

e There is no work on MapReduce-based fuzzy C-Me- 
doids which experimented on document big data with 
high dimensions and with a combination of standard 
and self-crawled datasets. In this work, 1 standard and 5 
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self-crawled datasets with varying sizes are used for a 
better understanding of results. All these datasets are 
preprocessed with 10,000 dimensions and then clus- 
tered using the proposed algorithm. 

e No work in the literature has experimented with the 
proposed algorithm using the Hadoop distributed plat- 
form of variably sized clusters. In this work, the 
Hadoop cluster of 10 nodes is used along with 8, 5 and 
3 node clusters, and a pseudo-distributed Hadoop 
cluster is also used. This is to determine the perfor- 
mance gain that happened in each stage of Hadoop 
cluster enlargement. 


Methodology 


This section provides the design of the proposed MapRe- 
duce-based fuzzy C-Medoids. It has also been provided 
brief details on the MapReduce framework and docu- 
mented the big data preprocessing steps used in this work. 


MapReduce Paradigm and Hadoop 


MapReduce is a specific processing technique used for a 
distributed computing framework named Hadoop. Hadoop 
works in master-slave architecture where a master node 
starts and monitors data splitting, transferring, managing 
node failures using duplications of data splits, transferring 
and collecting computing results from slaves. Hadoop has a 
novel file system named HDFS which deals with all stor- 
age-related managements between masters and slaves. 
Similarly, the processing part of Hadoop is maintained by 
the MapReduce framework. MapReduce performs an 
individual job by twofold functions: map and reduce. The 
map tasks (mappers) are usually much more than reduce 
tasks (reducers). The mapper works in each slave node and 
mainly processes the specific part of the task which is 
possible to execute in parallel. The mappers provide the 
output to the reducers. The reducers, after executing the 
part of the task specified for it, provide the final output. The 
input and output type of mappers and reducers is <key, 
value> pairs. It is prescribed to use separate nodes for the 
map and reduce, especially when the large dataset is being 
analyzed by an algorithm. Figure | shows the style of 
MapReduce execution on the Hadoop cluster. 


Document Big Data Preprocessing 
The fuzzy C-Medoids should be inputted with numerical 


values to produce clusters. The document big data is thus 
preprocessed using the following steps [32]: 
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Fig. 1 Data processing in 
MapReduce on Hadoop 


l big-data: -| 


e Tokenization: The unique words (terms) across the text 
files of the document dataset are retrieved to create a 
dictionary as per the bag-of-words model. 

e Weighting: This step extracts features of each word in 
the dataset by specifying each term in the dictionary 
through a weight value. This weighting is performed by 
the term frequency * inverse document frequency 
(tf*idf) method. TF specifies the frequency of occur- 
rence of a word in the particular text file in a dataset. 
Only considering tf cannot provide the correct result as 
the common words like “the,” “a,” etc., occur more 
frequently in a text file but does not mean these 
commons words are of more importance in the text file. 
To get a more accurate weighting of a term, tf is 
multiplied by idf. IDF is calculated based on the 
frequency of a term in a text file normalized to its 
presence in the entire dataset. As term occurrences are 
disseminated exponentially, a logarithm of the weight is 
calculated for a better weighting of each unique term. 


The preprocessed document big data produces a unique 
number for each term and assigns values to it which depicts 
its weightage for the entire dataset. 


Proposed MapReduce Fuzzy C-Medoids 


This work modifies the traditional fuzzy C-Medoids algo- 
rithm using the MapReduce paradigm for clustering doc- 
ument datasets. 

The traditional fuzzy C-medoids steps are briefed as 
below: 


1. Randomly select k initial data objects as cluster centers 
© = {O1, ©2,...,..-,Oj,---, Ok}, from the input data- 
set containing data objects O = {01, O2, ...,0;...,0y}, 
fuzzy co-efficient m. 

2. For each data object O;, for each cluster center ©, 
calculate membership matrix pij. 
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Each color denotes 


"4 <key, value> pairs y ] 


E 


Output n 


= E z ae (1) 


3. Re-calculate the cluster centers Ò; for the original 
cluster center ©; from the values of membership with 
each object O; € O. 


we MO; 
2a My 
4. Discover medoid: Find the nearest data objects O; € O 


for each 0; calculated which was using Eq. (2) and 
replace O; for Ò; 


Oj = (2) 


It is crucial to design the mapper and reducer of a 
MapReduce job such that the parallel parts can be assigned 
to the mapper and serial parts on the reducer. If this is 
accomplished properly, then the designed algorithm can be 
efficient. We observed that in fuzzy C-Medoids the parallel 
parts are membership calculation (Eq. 1) as the splits in 
mappers can calculate the membership value of the objects 
in its splits if provided with centroids. The sequential part 
is centroid recalculation (Eq. 2) and discovery of medoid 
and is possible by a reducer to compute this if provided 
with objects and corresponding membership values for a 
centroid. Thus, we have designed membership calculation 
in mappers and centroid recalculation and medoid discov- 
ery in the reducer. The proposed MapReduce-based fuzzy 
C-Medoids is provided in Fig. 2. 

Figure 2 shows the mapper calculating membership 
value (u) for each centroid (©; € ©) to each data objects 
(O; € O) (line 5). The y is squared (line 6) in mapper only 
although the squared membership value (uw) is used in 
reducer during the calculation of centroid ©; — 0; using 
Eq. (2). This is to load the reducer with as much less 
computation as possible. The concatenation of O; and Ie, 
for each ©), is stored in a list (U,) (line 7). This list can 
store only one object, corresponding we for a centroid. 
There is another list maintained for holding concatenations 
of each centroid ©; € ©, and list of pairs of all O; € O and 
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Fig. 2 The MapReduce design 
of proposed fuzzy C-Medoids 
for document clustering 


numeric vector form O = 


Oee Oe TOF 


<key, value> pair input is taken where key is offset 
and value is object. Each mapper also gets the initial 
/last iteration outputted centroid access. 


Output: An intermediate <key, value> pair of (O; , Ù) 


where 1<j<k 
1. pij=0 
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Mapper and Reducer Pseudo Code for Text Clustering using Fuzzy KMedoids 
Algorithm for Mapper 

Input:A set of document objects transformed in 

{O;, Oy, ...,0;...,0,}, 

randomly chosen « initial cluster centers © ={ Oj, 


Algorithm for Reducer 
Input: <Key, Value> pair of <Onj, <list> Uj > 
outputted by M mappers, where Oyj is Centroid 
number of ©; and Uj is the concatenated list of 
mapper input O; € © and its corresponding wij value. 
This <list> Uj is obtained by sorting and shuffling 
phase of reducer on Uj for each Onj. 
Output: <Key, Value> pair of (NULL, O) where 
NULL is a space and n O are newly calculated 
medoids values. 


2. <list>tij, <list>Uj =NULL 1. N=0 
3. FOR each Oje © DO 2. d=0 
4 FOR each input.value (O; € O) DO 3. Ò=NULL 
5 Calculate uij = ——__, 4. FOR each input.key (Om Da r 
l 5. FOR each input.value (Uj € <list> Uj) DO 
6 wii = Mij * i 6. O;=extract.ObjectValue(ù;j) 
7. <list>ŭij = O; concat( Win Ts Wis = extract.Membership Value(tjj) 
8 END FOR 8. N+=0;* wi 
9. — <list>Uj = Oj. concat(ŭij) 9. d+ i 
10. END FOR R b 10. END FOR 
11. FOR each Uj e <list>Uj DO 11. Ò=N/d 
12. ©j;= extract.centroid(Uj) 12.  O.add(O;) 
13. Onj=extract.centroidnumber(O;) 13. ENDFOR 
14. <list>ŭij -extract. MemMatrix(Uj) 14. FOR each Oje O DO 
15. FOR each tj € <list>ŭij DO 15. FOR each Oj <0 | 
Bic, 16. IF (dis=82 (Oi, O;) < a ) THEN 
16. release(Onj, Ujj) 17 SG. 
17. ENDFOR i 
18. END FOR ie e=dis 
19 END IF 


its corresponding K? concatenations (line 9). This list is 
required as the same slave used as mapper now can be 
chosen as reducer by Hadoop, thus major calculation of 
mapper, i.e., membership calculation to be completed in 
one straight direction. It is also feasible to pass each pair of 
the list one at a time to a reducer using another procedure 
written outside the mapping procedure in mapper. This size 
of the list is quite big in size as considering the document 
big data size, making it not feasible to output directly in the 
reducer. Hence, from the list, the centroid number (©,,) is 
extracted (line 13) and each list pairs of all objects O; € O 
and we are extracted (ùŭ;;) (line 14). At last, the ©,,; and list 
of an object O; and corresponding u° are sent to reducer 
one by one (line 16). 

The reducer works in 3 phases of sorting, shuffling and 
reducing. These sorting and shuffling phases automatically 
arrange the mapper output values O; and corresponding we 
concatenation for each of ©,; and fed to the reducer code. 
The reducer extracts these O; and corresponding ue values 
and starts recalculation of each centroid in two parts using 
Eq. (2) (lines 6 and 7). For each ©,,;, the numerator and 


21. IF (O; € Ö OR O= NULL) 
227 Ö.add(Ö;) 

23. END IF 

24. END FOR 


25. FOR each 0j € Ö DO 


26. release(NULL, Oj) 
27. END FOR 


denominator part of Eq. (2) is calculated (lines 8 and 9), 
respectively, to obtain the final recalculated centroid value 
Oj. The newly calculated centroids are stored in a list 
named O (ine 12). Then for each newly calculated cen- 
troids obtained using fuzzy rule, we find out the nearest 
data object as the medoid for that particular centroid (line 
16) and replace the medoid for that particular centroid (line 
17). The newly discovered medoids are stored in a list O 
while eliminating duplicates if any (line 22). The reducer 
finally releases all of the medoids as output (line 26). This 
is stored in HDFS of the master as an output file named 
part-r-00000. This part file is considered as centroids for 
new iterations with the same mapper and reducer 
workflow. 


Experimental Results 
Intensive experimentations are conducted to evaluate the 


efficiency of MapReduce-based fuzzy C-Medoids for 
document clustering. The comparison of the execution time 
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of the MapReduce-based algorithm for different document 
dataset sizes and different sizes (nodes) of Hadoop clusters 
is also compared. The Hadoop cluster is crafted of com- 
modity computing node configured with AMD ryzen 3pro 
2200 g with Radeon vega graphics card, 8 GB RAM 
(DDR3) and 500 GB of HDD with a measured bandwidth 
of 100 MB/s. The Hadoop installed over Ubuntu 18.04 
LTS operating system. The document dataset consists of a 
standard and popular dataset named 20_newsgroups, along 
with five more crafted datasets. The 20_newsgroups dataset 
after preprocessing (dictionary terms id with tf*idf value) 
becomes an 80 MB dataset while choosing 10,000 dic- 
tionary terms. Also, 5 datasets are crafted by crawling the 
web news and articles. This dataset consists of 20 folders 
containing unstructured text files of different news and 
articles from the domain of books, business, cars, crime, 
culture, economy, education, finance, health, Indian cities, 
law, life, marketing, opinion, politics, science, sports, tech, 
travel and world. The dataset is enriched slowly with news 
and articles for specific folders. In this process, we have 
preprocessed the dataset such that we can get a prepro- 
cessed dataset of sizes 100 MB, 250 MB, 500 MB, 
750 MB and 1 GB. To make the datasets uniform, all these 
different sized datasets consist of 10,000 dimensions. 
These datasets have experimented with different Hadoop 
cluster sizes of 3-node, 5-node, 8-node and 10-node clus- 
ters. The datasets are also experimented with using a 
pseudo-distributed single-node Hadoop cluster with 1 
master and | slave feature. The differences in Hadoop 
cluster size and dataset size are used to evaluate the effi- 
ciency of the proposed algorithm. The experiments are 
conducted using 3 iterations of the proposed K-Means, and 
20 is the number of centroids used for all the experiments. 

An effective technique for comparing two quantities is 
ratio calculation. Ratio depicts one numeric value can be 
contained how many times within other numeric value. To 
critically evaluate the performance of the proposed algo- 
rithm with different Hadoop cluster and dataset size, 
10-node clusters execution time is considered as a unit of 
comparison and ratio on the execution time for other 
datasets is calculated. Table 1 provides the execution time 
in seconds for different execution strategies, i.e., for dif- 
ferent datasets in different Hadoop clusters. It can be 
observed that the bigger the Hadoop cluster is the better the 
performance it provides. For example, a 10-node cluster 
can perform clustering jobs almost 3 times faster than a 
single-node cluster and almost 2.5 times faster than a 
3-node cluster for a 1 GB document dataset. This analytical 
interpretation of the result for MapReduce-based fuzzy 
C-Medoids depends on the design of this algorithm. In 
MapReduce-based fuzzy C-Medoids, the mapper can per- 
form the task of membership calculation for each centroid 
to all of the objects it receives in the data split. The reducer 
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performs the recalculation of each centroid by adding the 
values of objects and their respective squared membership 
and then dividing the value with the sum of all squared 
membership values received from all the objects to this 
centroid. This reducer operation in fuzzy C-Medoids takes 
significantly higher execution time. One thing should also 
be considered that the dataset is having 10,000 dimensions, 
apart from largeness in size. That is while calculating 
multiple distances between centroids and objects in map- 
pers, a lot of comparisons is taken place to find the same 
dimensions and many loops had to construct for obtaining 
distance values. If the objective were to cluster simple 
l-dimensional data points of the same sized datasets, the 
execution time would result in substantially lesser than our 
experiments. 

Table 2 provides an alternative view of the experiments. 
Here, the ratio is shown such that we can see how much 
multiplications of time a higher-sized dataset takes using 
the proposed algorithm in a particular Hadoop cluster 
setup. For example, we can see that for 10-node Hadoop 
cluster setup, it is almost 17 times more time-consuming to 
cluster a 1 GB dataset than standard 20_newsgroups 
datasets, although the 1 GB dataset is almost 13 times 
higher than 20_newsgroups datasets. 

To analyze the performance of proposed algorithms, 
data scale-up is depicted using the line graph in Fig. 3. 
Data scale-up provides us with a clear portrayal of gain in 
performances for datasets and Hadoop cluster size. This 
enables us to evaluate MapReduce performances. The 
horizontal part of the line graph in Fig. 3 represents the 
number of different data sizes used for experimentations, 
and the vertical part of the line graph represents the exe- 
cution time obtained from different executions of fuzzy 
C-Medoids. The lines in the graph are colored, and each 
colored line represents a particular type of fuzzy C-Me- 
doids execution used in the experimentation. 

The table and line graph clearly show that the proposed 
algorithm is designed properly and working well for 
complex preprocessed document datasets and able to 
cluster the dataset efficiently. It can also be seen that the 
Hadoop cluster is working well for the standard dataset and 
larger datasets as well. It is also observed from the graph 
that the scale-up factor is not linear. That is for equal 
growth of dataset size, equal time complexity growth is not 
happening. For example, for clustering 500 MB dataset 
proposed algorithm took 2225 s, but to cluster 1 GB data 
3716 s was taken by 10-node clusters. This makes for 
doubling the clustering dataset size the time taken is 1.67 
time. The ideal linear time to be 2 times as the dataset is 
increased by 2 times. The same thing can be seen for every 
clustering configuration. This happens due to 2 reasons. 
The first reason is that the design of fuzzy C-Medoids is 
such that the reducer also had to perform a significant 
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Table 1 Execution time and performance ratio for different execution strategy 


Dataset size and ratio 


MapReduce fuzzy C-Medoids execution time in second 


Pseudo-distributed 


20_newsgroups 552 499 

Ratio 2.52 2.29 
100 MB 670 585 

Ratio 2.8 2.45 
250 MB 2371 1941 
Ratio 2.87 2.35 
500 MB 6452 5073 
Ratio 2.89 2.28 
750 MB 8827 7982 
Ratio 2.93 2.65 
1 GB 11,074 9141 


3-node cluster 


5-node cluster 8-node cluster 10-node cluster 


388 285 218 
1.78 1.31 1 
502 346 239 
2.1 1.45 1 
1569 1140 826 
1.9 1.38 1 
4227 3048 2225 
1.9 1.37 1 
5662 4518 3012 
1.88 1.5 1 
7432 5871 3716 
2 1.58 1 


Ratio 2.98 2.46 


Table 2 Alternative view: performance ratio with respect to different execution strategy 


250 MB 500 MB 750 MB 1 GB 
2371 6452 8827 11,074 
4.3 11.69 16 20 
1941 5073 7982 9141 
3.89 10.17 16 18.31 
1569 4227 5662 7432 
4 10.9 14.59 19.15 
1140 3048 4518 5871 
4 10.69 15.85 20.6 
826 2225 3012 3716 
3.79 10.20 13.81 17.04 


Ratio 1 1.09 


Node(s) Dataset 
20_newsgroups 100 MB 
Pseudo-distributed 552 670 
Ratio 1 1.21 
3-Node cluster 499 585 
Ratio 1 1.17 
5-Node cluster 388 502 
Ratio 1 1.29 
8-Node cluster 285 346 
Ratio 1 1.21 
10-Node cluster 218 239 
MapReduce based Fuzzy C-Medoids 
12000 


Exec Time (sec) 
an 
Q 
=] 
Oo 


250MB 
Dataset Size 


20_newsgroups 100MB 500MB 750MB 1GB 


——Pseudo-distributed ——3-node ——5-node —— 8-node ——10-node 


Fig. 3 Line graph of data scale-up 


amount of calculations. It is significantly high for prepro- 
cessed document datasets with multiple dimensions. The 
second reason is that a Hadoop cluster generally has many 


networks and processes overhead which consumes the 
sometimes significant time of a node and affects the overall 
efficiency of the cluster. 

The work of MapReduce-based fuzzy C-Medoids for 
document clustering is novel as there is no such work 
reported in the literature. This making us unable to com- 
pare the work concerning MapReduce-based fuzzy C-Me- 
doids works of the literature for document clustering. 
However, we have chosen to compare the work with two 
different strategies: (1) the work of MapReduce-based 
fuzzy C-Medoids is experimented and compared on the 
same datasets using traditional serial fuzzy C-Medoids 
algorithm, and (2) the work is compared with the work of 
MapReduce-based K-Means algorithm which is also 
experimented on the similar document datasets. This work 
is reported in very recent 2020 [33]. We have elaborated 
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these two comparisons with our proposed work in the 
following sections. 


Comparison of Proposed MapReduce-based Fuzzy 
C-Medoids against Traditional Serial Fuzzy 
C-Medoids Algorithm 


To comprehend the efficiency benefit of the proposed 
MapReduce-based fuzzy C-Medoids algorithm, the work 
of traditional serial fuzzy C-Medoids algorithm is designed 
and experimented with the same Hadoop clusters and same 
document datasets. The work produces a similar data 
table of Table 1 except a new column of data of traditional 
fuzzy C-Medoids (highlighted using a different color, the 
2nd column from left) and is provided in Table 3. 

Table 3 shows that the proposed MapReduce-based 
fuzzy C-Medoids algorithm is better than the traditional 
version of the algorithm for 3-node, 5-node, 8-node and 
10-node Hadoop clusters. The larger the Hadoop cluster the 
better the performance. For example, the proposed 
MapReduce-based fuzzy C-Medoids algorithm 10-node 
Hadoop cluster is 2.29 times faster than 3-node cluster, but 
it is 2.51 times faster than the traditional version of it when 
experimented with 20 newsgroups (80 MB dataset). 
However, the proposed MapReduce-based fuzzy C-Me- 
doids algorithm is 2.92 times faster than its traditional 
version when experimented with 1 GB dataset. Then, the 
proposed MapReduce-based fuzzy C-Medoids algorithm is 
more efficient when experimented with larger datasets. 
Table 3 also shows that the traditional fuzzy C-Medoids 
algorithm is more efficient than the pseudo-distributed 
execution of the algorithm for the same datasets. The 
reason is for pseudo-distributed Hadoop execution the 
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overhead cost of running master and slave daemons and 
running of mapper and reducer in the same node consume a 
suitable amount of time. Hence, we also come to know the 
fact that the pseudo-distributed mode of execution can only 
be used for testing purposes. The proposed MapReduce- 
based fuzzy C-Medoids is hence well efficient and scalable 
than traditional Fuzzy C-Medoids. 


Comparison of proposed MapReduce-based Fuzzy 
C-Medoids against MapReduce-based K-Means 
Algorithm 


The proposed work is also analyzed concerning existing 
work of MapReduce-based K-Means algorithm which is 
also experimented on the similar document datasets and 
reported in recent 2020 literature [33]. The work shows the 
following execution time and ratio calculation as reported 
in Table 4. 

The comparison of execution time and the ratio of 
Tables 1 and 4 shows that the MapReduce-based K-Means 
algorithm is quite faster than the proposed MapReduce- 
based fuzzy C-Medoids. For example, for a 100 MB 
dataset, MapReduce-based fuzzy C-Medoids takes 670 s 
for pseudo-distributed cluster, whereas MapReduce-based 
K-Means takes only 151 s. The same thing can be observed 
for all the datasets and Hadoop cluster used in the exper- 
imental setup for both the algorithms. The higher the size 
of the dataset, the larger the execution time difference 
arises. The analysis of this gap guided us into the reason- 
ing, and we have discovered the following: 


e The difference in hardware used to form the Hadoop 
cluster: AMD ryzen 3pro is used in the Hadoop cluster 
for MapReduce-based fuzzy C-Medoids algorithm 


Table 3 Execution time and performance ratio for different execution strategy 


Dataset size and ratio Traditional fuzzy C-medoids 


MapReduce fuzzy C-Medoids execution time in second 


Pseudo-distributed 


20_newsgroups 547 552 
Ratio 2.51 2.52 
100 MB 652 670 
Ratio 2.73 2.8 
250 MB 2329 2371 
Ratio 2.82 2.87 
500 MB 6430 6452 
Ratio 2.83 2.89 
750 MB 8735 8827 
Ratio 2.9 2.93 
1 GB 10,851 11,074 
Ratio 2.92 2.98 


3-node cluster 5-node cluster 8-node cluster 10-node cluster 


499 388 285 218 
2.29 1.78 1.31 1 
585 502 346 239 
2.45 2.1 1.45 1 
1941 1569 1140 826 
2.35 1.9 1.38 1 
5073 4227 3048 2225 
2.28 1.9 1.37 1 
7982 5662 4518 3012 
2.65 1.88 1.5 1 
9141 7432 5871 3716 
2.46 2 1.58 1 
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Table 4 Performance of MapReduce-based K-Means algorithm as in [33] 


Dataset size and ratio 


MapReduce-based K-means execution time in second 


Pseudo-distributed 


100 MB 151 140 
Ratio 2.9 2.7 

250 MB 195 164 
Ratio 2.24 1.88 
500 MB 520 300 
Ratio 4.7 2.72 
750 MB 637 324 
Ratio 4.8 2.43 
1 GB 649 403 
Ratio 4.6 2.85 


3-node cluster 


5-node cluster 8-node cluster 10-node cluster 


120 79 52 
2.3 1.51 1 
140 114 87 
1.6 2.19 1 
176 135 110 
1.6 1.22 1 
274 165 133 
2 1.24 1 
365 180 141 
2.58 1.27 1 


execution, but the Intel Core 2 Duo processors are used 
for MapReduce-based K-Means algorithm. The RAM 
and network connectivity configuration is the same for 
both algorithms. The Intel Core 2 Duo processors may 
be the reason behind a bit greater processing capacity of 
MapReduce-based K-Means algorithm. 

e The algorithm structures: The key reason lies in the 
structure difference between the two algorithms. As 
provided in methodology section of [33] the MapRe- 
duce-based K-Means algorithm is quite easy. It calcu- 
lates the distance between centroids and data points in 
the mapper, and recalculation of centroids is happened 
by adding mapper omitted centroid values by the 
number of objects added to each of the centroids. On 
the other hand, the huge difference in the structure of 
the algorithm, the MapReduce-based fuzzy C-Medoids 
algorithm performs the high time-consuming operation 
of membership calculation (which has multiple distance 
computations in it) in the mappers. The mappers output 
large strings of centroid-wise membership matrix value 
and corresponding object to the reducers. This step 
consumes large network time. Again, the reducer 
recomputed the centroids using Eq. (2) which is the 
combination of multiplication and summation job, 
along with distance calculation, making reducer also 
consume a huge amount of time for this equation. The 
reducer job now continues with these partially obtained 
centroids with again another distance calculation job 
executes for the searching nearest object of the centroid 
value and to replace with. However, this extra time 
consumption will definably generate quality output on 
cluster job as the researchers unanimously agreed with 
the fact that the fuzzy clusters perform better for real- 
world datasets. It is also a wisely known fact in the 
research community that the fuzzy clusters are good in 
quality even for noisy datasets. 


Conclusion and Future Work 


Big data requires a novel design of the clustering algorithm 
and computing architecture. In this work, a novel design of 
MapReduce-based fuzzy C-Medoids is designed and 
experimented with over the Hadoop cluster of variable size. 
The dataset used in the experiment consists of both stan- 
dard datasets and self-crafted datasets up to 1 GB in size. 
The extensive experimental results and its analysis show 
that the proposed algorithm is well scalable for massive 
datasets and different Hadoop cluster sizes. The analysis of 
experimental data shows that the performance gain is 
uneven for cluster size and dataset size due to Hadoop 
cluster overhead and algorithmic design. The observation 
also directs us that the large the Hadoop cluster size, the 
better the proposed algorithm works. The main contribu- 
tion of this work lies in the experimentation of the pro- 
posed algorithm on big data of document type for 
clustering. There is no such work reported in the literature 
with such detailed design and details of MapReduce-based 
fuzzy C-medoids algorithm and such large-scale experi- 
mentations for many datasets, each with 10,000 dimensions 
and different Hadoop clusters. 

As future work, the proposed algorithm can be extended 
with MapReduce-based fuzzy C-least medians. The design 
of fuzzy C-least medians can extensively experiment for 
performance gain with the same dataset size and Hadoop 
cluster size. Further, cluster quality analysis of MapRe- 
duce-based fuzzy C-Medoids and MapReduce-based fuzzy 
C-least medians can be obtained and compared. The work 
can also be enhanced with a quality-wise comparison 
between MapReduce-based fuzzy C-Medoids and 
MapReduce-based K-Means algorithm using different 
metrics. 
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